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<n : 

An efficient numerical algorithm is presented for massively parallel simulations of dispersion- 
managed wavelength-division-multiplexed optical fiber systems. The algorithm is based on a weak 
O^l ' nonlinearity approximation and independent parallel calculations of fast Fourier transforms on mul- 

. tiple CPUs. The algorithm allows one to implement numerical simulations M/2 times faster than 

■ a direct numerical simulation by a split-step method, where M is a number of CPUs in parallel 
Pj^ ' network. 

^ ! OCIS codes: 060.2330, 060.5530, 060.4370, 190.5530, 260.2030. 

A wavelength-division-multiplexed (WDM) dispersion-managed (DM) optical fiber system is the focus of current 
<~ ~ i research in high-bit-rate optical communications. High capacitx-pf optical transmission is achieved using both wave- 
length multiplexing and dispersion management. (See e.g. Ref.Ero). Wavelength multiplexing allows the simultaneous 
transmission of sevesal information channels, modulated at different wavelengths, through the same optical fiber. A 
dispersion-manageduu optical fiber systems are designed to achieve low (or even zero) path-averaged group- velocity 
^Ih 1 dispersion (GVD) by periodically alternating the sign of the dispersion along an optical fiber. This dramatically 
reduces pulse broadening. Second-order GVD (dispersion slope) effects and path-averaged GVD effects cause optical 
pulses in distinct WDM channels to move with different group velocities. Consequently modeling of WDM systems 
_ requires simulating a long time interval. Enormous computation resources are necessary to capture accurately the 
nonlinear interactions between channels which deteriorates bit-rate capacity. The large computational resources 
required to simulate WDM transmission over transoceanic distances make parallel computation necessary. Here an 
efficient numerical algorithm is developed for massive parallel computation of WDM systems. The required com- 
putational time is inversely proportional to the number of parallel processors used. This makes feasible a full scale 
numerical simulation of WDM systems on a workstation cluster with a few hundred processors. 
£sq \ Neglecting polarization effects and stimulated Raman scattering and Brillouin scattering, the propagation of WDM 

■ optical pulses in a DM fiber is described by a scalar nonlinear Schrodinger equation (NLS): 
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= iG(z)A, (1) 

where z is the propagation distance along an optical fiber, A(t, z) is the slow amplitude of light; 02 and f3 3 are the 
first and second-order GVD respectively which are periodic functions of z\ a — (27m 2 )/(Ao^4 e //) is the nonlinear 
coefficient; ri2 is the nonlinear refractive index; Ao = i.55/im is the carrier wavelength; A e ff is the effective fiber 
area; z^ — kz a (k = 1, . . . ,N) are amplifier locations; z a is the amplifier spacing; and 7 is the loss coefficient. Note 
that distributed amplification can be also included in G(z) without changing the following analysis. 

— f Z G(z')dz' 

The change of variables u = Ae Jo results in the NLS with the z-dependent nonlinear coefficient c(z) = 

a(z) exp (2 J Q Z G(z')dz') : 

i i 

iu z - -(32{z)u tt - -(3 3 (z)u ttt + c(z)\u\ 2 u = 0. (2) 



'Accepted to Optics Letters (2001) 



1 



By applying Fourier transform u(u),z) = J"^^ z)e lu)t dt to Eq. (Q), changing variables u(u),z) = 
-0(w,z)exp (i J* dz'[uj 2 foiz') + ^(3a{z')fj and integrating Eq. (^) over z from zq to z one obtains the following 
integro-differential equation: 

4>(w,z) = tp(oj,z ) + 

iR(ip[uj,z} 1 uj,z,ZQ), (3) 



where 



x exp 



R(v[w,z],to,z,zo) = J2^y J du>idv2du>3 j dz' 
x c(z') 0<*'>(wi, 2>)&"'\u> 2 , z')v* W (w 3 , z') 

X(5(^i + w 2 - w - W3)) 
v*- 2 -* (w, z) = z) 

x exp (| f dzVftCO + ^W)]) ■ (4) 



If the nonlinearity is small: 3> z,n spi where z„; = l/|p| 2 is a characteristic nonlinear length, z^isp = 7" 2 / 1 /?2 1 
is the dispersion length; p and r are typical pulse amplitude and width respectively. Then one can conclude that 
ip(tu, z) is a slow function of z on any scale L <C z n i because all of the fast dependence of u is already included in 
the term exp J* dz' '[uo 2 fhiz ') + ^P3(z')]\ (see referenceB~i) . This term is nothing more than an exact solution 

of the linear part of Eq. (|J) . In first approximation one can neglect the slow dependence of ip on z in the interval 
mL < z < (m + 1)L, i.e. one can replace ip[uj, z] by tp[ui, mL] in the nonlinear term R (m is an arbitrary nonnegative 
integer number). This substitution allows to rewrite Eq. (|3|) in the following form: 

V>(w, (m + 1)L) — mL) + 
iR(<ip[u, mL],w, (m + 1)L, mL) + 0{— ) 2 . (5) 

Znl 

The term 0(L/z n i) 2 indicates the order of accuracy of this approximation. Eq. (0) enables one to find "4>(u>, (m + l)L) 
given ip(uj,mL). Thus one can recover recover u(t, z) using the definition of ip. However for WDM simulation, the 
accuracy 0(L/z n i) 2 is not always sufficient. The next order approximation is obtained by including the first order 
correction, i/)W(w,mi), in the nonlinear term, R: 

(m + 1)L) = i/j(l>, mL) 

+iR(tP^[Lu,z},uj,z,mL) + 0( — ) 3 , (6) 

Znl 

^^(uj, z) = "4>{ui, mL) + iR(ip[u;, mL], lu, z, mL). (7) 

Equations (0), (|^), and (^) form a closed set for the approximate calculation of V>(u>, (m + l)L) given ^(w,ml), 
where 0{-^-) is the accuracy of the approximate solution which is controlled by the appropriate choice of L. 
The main obstacle in the numerical integration of Eqs. (^), ([|), and (^) is the computation of the integral term 
R(i)[u!, z],uj, z,mL^ which generally requires M x N 3 operations for each iteration, where N is the number of grid 
points in u> or i-space and M is the number of grid points for integration over z. Next one presents a very efficient 
numerical algorithm for calculations R(y[ui, z], u>, z, mL). 

In i-space Eq. (0) becomes 
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F- x (R(v[uj],u),z,mL)^ 

dz'c(z')G^'\V {z '\t,z')), (8) 

where F -1 is the inverse Fourier transform over to; V^(t, z) = \v^ z '(t, z)\ 2 v^ z \t, z) and is the integral operator 
corresponding to the multiplication operator G^ z ^ (^^(u), zj) = exp ^ — | J* dz'[uj 2 foiz') + ^/^(z')]^ V^(u>, z) in 

the w-space. It follows from Eqs. (|j), (§) — (||) that the numerical procedure for calculation of R(A,u>) requires the 
following eight steps: 

(i) The Inverse Fourier Transform of v^(uj,mL) — -0(oj,mi)exp dz'[u> 2 ^{z') + ^/^(z')]) for every value 

of z (mL < z < (m + 1)L). 

(ii) A calculation of (t,mL) from (t,mL). 

(iii) The Forward Fourier Transform of V^ z ' (t,mL). 

(iv) A numerical integration (summation) of c(z') exp ( — | J * dz'[L0 2 ^{z') + ^Pz{z')\^ V^ z '(w) over z' (from 

z' = mL to z' = z) for every values of lo and z (mL < z < (m + 1)L). This integration gives i/j^(uj, z) according to 
Eq. (@). 

(v) The Inverse Fourier Transform of v^ z \oj, z) = ip^(u>, z) exp dz'[oj 2 ^{z') + ^(3 3 (z')]j for every value of 
z, mL < z < (m + 1)L. (Note that in contrast with step (i) it is necessary to take into account the dependence of 
ffl 1 ' on z). 

(vi) -(viii) These steps are similar to steps (ii)-(iv) except that the new value of v^(u>, z) is used which was obtained 
in step (v). 

The forward and inverse Fourier transforms can be performed with the fast Fourier transform (FFT) which requires 
NLog2(N) numerical operations. Steps (i)-(iii) need only the value of (/i(t,mL). These steps can be performed 
independently and simultaneously in a network of M central processor units (CPUs), shown schematically in Fig. 
1. The number of CPUs, M, coincides with the number of grid points for integration over z. Thus the effective 
computational time equals to time necessary to perform 2NLog2{N) operations on complex numbers in one CPU. 
Below to estimate effective computational time one always refers to the number of numerical operations in one CPU 
if all calculations can be implemented simultaneously in different CPUs without communication between them. 

The resulting values of V^(t,mL) (after step (hi)) are a set of M vectors a m , (m = 1,2, .. . , M) consisting of 
N complex numbers each. Every vector a m is stored in the memory of the mth CPU (or in memory assigned to 
mth CPU in shared memory network). To perform step (iv) one replaces these vectors by the new vectors b m : 
b m = X)jli a j' ( m = 1,2,...,M). Here a simple parallel algorithm is given. Note that this algorithm can be 
improved but this improvement is outside the scope of this Letter. It is assumed that M is a power of 2: M = 2 M °, 
M e is an integer. The proposed algorithm requires M e substeps. The vectors b™ , (m = 1,2,...,M) are results 
of fcth substep stored in memory. So that = b m . The first substep is to sum up every pair of vectors: 

&2m + a 2m+ i to get = ai, = ai + a 2 , . . . , b^_ 1 = a M -i, b^ = a M -i + elm- This summation requires 
N operations. By induction one can see that after k substeps, bm' = X)JLi a j f° r 1 < m < 2 fc , bm = X)JL2 fc +i a i 
for 2 fe + 1 < m < 2 k + 2 k , . . . , b^ } = Ejl 2 «.= -2*+i a i for 2M " - 2 fe + 1 < m < 2 M ° . Note that M vectors are 
now grouped in M/2 k blocks with the appropriate summation inside each block. To perform the k + 1th substep, 
it is necessary to double the block size. This can be done by adding the last element of each odd block to each 
element of next even block. To do this, one first creates in memory of 2 k copies of the last element of each odd 
block, which requires kN operations in a parallel CPU network. (A number of copies can be doubled by memory 
forking after each N operations). To complete the k + 1th substep, it is now enough to simultaneously add 2 k 
copies to each element in the even block, requiring N operations. The total number of operations for step (iv) is 
[1 + 2 + . . . M e ]N = M e (M e + l)/2. Steps (v)-(viii) can be done in about N[2Log 2 {N) + Log 2 (M)} operations. (In 
step (viii) it is only necessary to calculate bjv/ requiring NLog2[M] operations). Thus the total number of operation 
for steps (i)-(viii) is: 

N[4Log 2 (N) + Log 2 {M) + Log 2 (M) 
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x (Log 2 (M) + l)/2] ~ N[4Log 2 (N) + ^ ' ]. (9) 

Direct solution of (||) by a split-step method with the same accuracy (for the same size of numerical step, L/M, 
and the same number of points N in ui space) requires 2MNLog2(N) operations. Comparing this with (|9j), one 
can conclude that the proposed parallel algorithm allows one to do numerical simulations with the same numerical 
accuracy about M/2 times faster using a network of M parallel CPUs. However the proposed algorithm is about 2 
times slower if only one CPU is used. 

Numerical simulations of the WDM system were performed using both the split-step method for NLS (§) and using 
the numerical algorithm given by Eqs. (^), (Q), and (]?]) to demonstrate the accuracy of the proposed numerical scheme. 
Simulations were performed for 5 WDM channels (20 Gb/s per channel) over a typical transoceanic distance of 
10 4 km. The channel spacing was 0.6 nm. The GVD periodically alternates between spans of standard monomode fiber 
{fffl = —20.0ps 2 /km, (3^ = O.lps 3 /km, <j\ — 0.0013 (kmmW)~ Y , length L\ = 40 km) and dispersion compensating 
fiber (4 2) = 103.9ps 2 /fcm, pf ] = -0.3ps 3 /km, a 2 = 0.00405 (km mW)" 1 , length L 2 = -(3 ( 2 ] L x / 13 ( 2 ] km) so that 
the average GVD is zero. Fiber losses and amplifiers were not considered. However they can be easily included in 
the coefficient c(z). A pseudo-random binary sequence of length 20 was used for every WDM channel. The boundary 
conditions are periodic in time. Each binary "1" was represented by an initially zero-chirp Gaussian pulse (return 
to zero format) of lOps width and peak power \u\ 2 = 1 mW at the beginning (z — 0) of the fiber line which is taken 
at the middle of standard monomode fiber span. The integration length L (see Eqs. (§),(§), (0)) is set to be equal 
to (Li + L 2 )/4; M = 2 9 ; and N = 2 11 . Fig. 2 shows the pulse power distribution (simultaneously in all 5 channels) 
after propagating 10 i km obtained from both the split-step and the proposed parallel algorithm. The differences in 
power distribution between these two simulations are less than 1% so the two curves are indistinguishable in Fig. 2. 
Numerical simulations were performed on usual workstation without use of parallel computations. The objective of 
this numerical example is to demonstrate the relative accuracy of numerical algorithm. Hardware implementation of 
the parallel simulation for numerical algorithm (ji|), (||), and (^) is beyond the scope of this Letter. 

One can conclude that the proposed parallel numerical algorithm allows one to implement numerical simulations 
of Eq. (Q) about M/2 times faster than a direct numerical simulation of that Eq. by the split-step method with the 
same accuracy. The absence of communications between parallel CPUs during the computation of the FFT allows 
one to implement the proposed massive parallel algorithm on workstation clusters. 

The author thanks M. Chertkov, I.R. Gabitov and S. Tretiak for helpful discussions. 
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Figure captions: 



Fig.l. A schematic of parallel computation algorithm and required number of numerical steps. FFT\, FFT2, . . . 
represent FFT in first, second etc. CPUs respectively. Righ-hand-side schematically shows calculation of vectors h m 
(see text). 

Fig. 2. Power distributions of 5 WDM channels after propagation of pseudorandom sequences of Gaussian pulses over 
10 4 km. Only small part of the total computational interval of lOOOps is shown. 
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